Non-Rationalised Geography NCERT Notes, Solutions and Extra Q & A (Class 6th to 12th) | |||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6th | 7th | 8th | 9th | 10th | 11th | 12th |
Chapter 2 Data Processing
Organizing and presenting raw data is the initial step in making it comprehensible and ready for analysis. Various statistical techniques are then used to extract meaningful insights from the data. This chapter introduces key statistical techniques for data analysis in geography.
These techniques are broadly categorized into three types:
1. Measures of Central Tendency
2. Measures of Dispersion
3. Measures of Relationship
Measures of Central Tendency provide a single value that represents the typical or central value of a dataset.
Measures of Dispersion describe how spread out or varied the data points are, often in relation to the central value.
Measures of Relationship (like correlation) quantify the degree of association or interdependence between two or more variables.
Measures Of Central Tendency
Geographical characteristics such as rainfall amounts, elevation, population density, educational attainment levels, or age groups all show variation. To understand these variations collectively, we often seek a single representative value that best summarizes the entire set of observations.
This representative value usually lies near the centre of the data distribution. Statistical methods used to find this central point are called measures of central tendency, also known as statistical averages.
The most common measures of central tendency are the Mean, Median, and Mode. Each provides a different way of identifying a central representative value and is suited to different types of data.
Mean
The mean is the arithmetic average of a dataset. It is calculated by summing all the values in the dataset and dividing the sum by the total number of observations. The method of calculating the mean differs slightly for ungrouped and grouped data, and can be done using either direct or indirect methods.
Computing Mean from Ungrouped Data:
- Direct Method: Sum all individual values ($\sum x$) and divide by the number of observations (N).
$ \text{Mean} (\bar{X}) = \frac{\sum x}{N} $
Example 2.1: Calculate the mean rainfall for Malwa Plateau districts from the rainfall data given.
Answer:
Rainfall data (x) for 7 districts: 979, 1083, 833, 896, 891, 825, 977 mm.
Sum of rainfall ($\sum x$) = $979 + 1083 + 833 + 896 + 891 + 825 + 977 = 6484$ mm.
Number of districts (N) = 7.
$ \bar{X} = \frac{6484}{7} = 926.29 \text{ mm} $
- Indirect Method: Used for larger datasets to simplify calculations. An 'assumed mean' (A) is chosen (ideally close to the actual mean). Deviations (d) of each observation from the assumed mean ($d = x - A$) are calculated. The mean is then calculated using the formula:
$ \text{Mean} (\bar{X}) = A + \frac{\sum d}{N} $
Example 2.1 (continued): Calculate the mean rainfall using an assumed mean of 800 mm.
Answer:
Assumed Mean (A) = 800.
Deviations (d = x - 800): 179, 283, 33, 96, 91, 25, 177.
Sum of deviations ($\sum d$) = $179 + 283 + 33 + 96 + 91 + 25 + 177 = 884$.
Number of districts (N) = 7.
$ \bar{X} = 800 + \frac{884}{7} = 800 + 126.29 = 926.29 \text{ mm} $
Districts in Malwa Plateau Normal Rainfall (x) in mms Deviation (d = x - 800) Indore 979 179 Dewas 1083 283 Dhar 833 33 Ratlam 896 96 Ujjain 891 91 Mandsaur 825 25 Shajapur 977 177 $\sum x$ and $\sum d$ 6484 884 $\bar{X} = \sum x / N$ and $\sum d / N$ 926.29 126.29 The mean calculated by both methods is the same.
Computing Mean from Grouped Data:
- Direct Method: When data is grouped into classes with frequencies, the midpoint (x) of each class represents the values in that class. Calculate the product of the midpoint and frequency (fx) for each class. Sum all these products ($\sum fx$) and divide by the total number of observations (N, which is the sum of frequencies, $\sum f$).
$ \text{Mean} (\bar{X}) = \frac{\sum fx}{N} $
Example 2.2: Compute the average wage rate of factory workers using the given data (Table 2.2).
Wage Rate (Rs./day) Classes Number of workers (f) 50 - 70 10 70 - 90 20 90 - 110 25 110 - 130 35 130 - 150 9 Answer:
Calculate midpoints (x) for each class and the product (fx).
Classes Frequency (f) Midpoints (x) fx 50-70 10 60 600 70-90 20 80 1600 90-110 25 100 2500 110-130 35 120 4200 130-150 9 140 1260 Total N = $\sum f = 99$ $\sum fx = 10160$ $ \bar{X} = \frac{10160}{99} = 102.6 $ Rs./day
- Indirect Method (Short-cut Method): An assumed mean (A) is chosen (often the midpoint of the class with the highest frequency or near the middle of the range). Calculate the deviation (d) of the midpoint of each class from the assumed mean ($d = x - A$). Multiply each deviation by its frequency (fd). Sum these products ($\sum fd$).
$ \bar{X} = A \pm \frac{\sum fd}{N} $
Alternatively, a simplified deviation (u) can be used, dividing d by the class interval (i): $u = d/i$. Then multiply u by frequency (fu), sum these products ($\sum fu$).
$ \bar{X} = A \pm \frac{\sum fu}{N} \times i $
Example 2.2 (continued): Compute the average wage rate using the indirect method, with an assumed mean of 100 (midpoint of 90-110 class) and interval of 20.
Answer:
Assumed Mean (A) = 100, Interval (i) = 20.
Classes Frequency (f) Midpoints (x) Deviation (d = x - 100) fd Simplified Deviation (u = d/20) fu 50-70 10 60 -40 -400 -2 -20 70-90 20 80 -20 -400 -1 -20 90-110 25 100 0 0 0 0 110-130 35 120 20 700 1 35 130-150 9 140 40 360 2 18 Total N = $\sum f = 99$ $\sum fd = 260$ $\sum fu = 13$ Using $\sum fd$:
$ \bar{X} = 100 + \frac{260}{99} = 100 + 2.63 = 102.63 $ Rs./day (slight difference due to rounding)
Using $\sum fu$:
$ \bar{X} = 100 + \frac{13}{99} \times 20 = 100 + 0.1313 \times 20 = 100 + 2.63 = 102.63 $ Rs./day
Median
The median is a positional average. It is the value that divides a dataset, when arranged in ascending or descending order, into two equal halves. It is not affected by the actual values of extreme observations, only by their position.
Computing Median for Ungrouped Data:
Arrange the data in ascending or descending order. The median is the value of the middle observation. The position of the median is found using the formula:
$ \text{Position of Median} = \left(\frac{N+1}{2}\right)^{\text{th}} \text{ item} $
Where N is the number of observations.
If N is odd, the median is the value at this position. If N is even, the median is the average of the values at the two middle positions (N/2 and (N/2)+1).
Example 2.3: Calculate median height for the given mountain peaks: 8,126 m, 8,611m, 7,817 m, 8,172 m, 8,076 m, 8,848 m, 8,598 m.
Answer:
Arrange in ascending order: 7,817; 8,076; 8,126; 8,172; 8,598; 8,611; 8,848.
N = 7.
Position of Median = $(7+1)/2 = 4^{\text{th}}$ item.
The 4th item in the arranged series is 8,172 m.
$ \text{Median} (M) = 8,172 \text{ m} $
Computing Median for Grouped Data:
For grouped data, the median is calculated using the cumulative frequency distribution to find the class where the median lies (the median class). The formula is:
$ M = l + \frac{\frac{N}{2} - c}{f} \times i $
Where:
- M = Median
- l = Lower limit of the median class
- N = Total frequency ($\sum f$)
- c = Cumulative frequency of the class *preceding* the median class
- f = Frequency of the median class
- i = Class interval width
Example 2.4: Calculate the median for the following frequency distribution:
Class | f |
---|---|
50-60 | 3 |
60-70 | 7 |
70-80 | 11 |
80-90 | 16 |
90-100 | 8 |
100-110 | 5 |
Answer:
Calculate cumulative frequencies (F) and find the median position (N/2).
Class | Frequency (f) | Cumulative Frequency (F) | Calculation of Median Class |
---|---|---|---|
50-60 | 3 | 3 | |
60-70 | 7 | 10 | |
70-80 | 11 | 21 (c) | |
80-90 | 16 (f) | 37 | Median group (N/2 = 25 is here) |
90-100 | 8 | 45 | |
100-110 | 5 | 50 | |
Total | N = $\sum f = 50$ |
$ N/2 = 50/2 = 25 $. The cumulative frequency next greater than 25 is 37, which falls in the 80-90 class. So, the median class is 80-90.
l = 80, N = 50, c = 21 (cumulative frequency of the class before 80-90), f = 16 (frequency of 80-90 class), i = 10 (class interval width).
$ M = 80 + \frac{25 - 21}{16} \times 10 = 80 + \frac{4}{16} \times 10 = 80 + \frac{1}{4} \times 10 = 80 + 2.5 = 82.5 $
Mode
The mode is the value that appears most frequently in a dataset. It is represented by Z or M0. The mode is generally less used than the mean or median.
Computing Mode for Ungrouped Data:
For ungrouped data, arrange the measures in ascending or descending order and simply count the frequency of each value to identify the one that occurs most often.
Example 2.5: Calculate mode for test scores: 61, 10, 88, 37, 61, 72, 55, 61, 46, 22.
Answer:
Arrange in ascending order: 10, 22, 37, 46, 55, 61, 61, 61, 72, 88.
The score 61 occurs 3 times, more than any other score.
$ \text{Mode} (Z) = 61 $ (Unimodal - one mode)
Example 2.6: Calculate mode for test scores: 82, 11, 57, 82, 08, 11, 82, 95, 41, 11.
Answer:
Arrange in ascending order: 08, 11, 11, 11, 41, 57, 82, 82, 82, 95.
The scores 11 and 82 both occur 3 times, which is the highest frequency.
$ \text{Mode} (Z) = 11 \text{ and } 82 $ (Bimodal - two modes)
If three values have the same highest frequency, the distribution is trimodal. If many values have the same highest frequency, it's multimodal. If no value is repeated, there is no mode.
Comparison Of Mean, Median And Mode
The relationship between mean, median, and mode can be visualized using a frequency distribution curve.
In a normal distribution, the frequency distribution is symmetrical and bell-shaped. In a perfect normal distribution, the mean, median, and mode all coincide and are located at the peak of the curve, representing the central value with the highest frequency.
However, if the data distribution is not symmetrical but skewed (pushed towards one end), the mean, median, and mode will not coincide.
- Positive Skew (Right Skew): The tail of the distribution extends towards higher values. The mode is at the peak, the median is to the right of the mode, and the mean is further to the right (pulled by the higher values). Mode < Median < Mean.
- Negative Skew (Left Skew): The tail extends towards lower values. The mode is at the peak, the median is to the left of the mode, and the mean is further to the left (pulled by the lower values). Mean < Median < Mode.
The choice of which measure of central tendency to use depends on the data type and distribution. The mean is sensitive to extreme values. The median is less affected by extreme values and is suitable for skewed distributions. The mode is useful for categorical data or identifying the most common value, but can be unstable and may not exist or be unique.
Measures Of Dispersion
Measures of central tendency alone do not fully describe a dataset. They tell us the centre but not how the data points are spread out around that centre. Dispersion (or variability) refers to the scattering or spread of scores or measurements within a distribution.
Using measures of dispersion alongside central tendency provides a better understanding of the distribution's characteristics, such as its homogeneity or variability.
Dispersion serves two main purposes: understanding the composition of a distribution and comparing the stability or homogeneity of different distributions.
Common methods for measuring dispersion are:
- Range
- Quartile Deviation
- Mean Deviation
- Standard Deviation
- Coefficient of Variation
- Lorenz Curve
The Range, Standard Deviation (as an absolute measure), and Coefficient of Variation (as a relative measure) are widely used. Quartile Deviation and Mean Deviation are less common.
Range
The range (R) is the simplest measure of dispersion, calculated as the difference between the highest (L) and lowest (S) values in a dataset.
$ R = L - S $
Example 2.7: Calculate the range for daily wages: Rs. 40, 42, 45, 48, 50, 52, 55, 58, 60, 100.
Answer:
Highest value (L) = 100, Lowest value (S) = 40.
$ R = 100 - 40 = 60 $
The range is highly influenced by extreme values and is considered an unstable measure of dispersion, similar to how the mode is an unstable measure of central tendency.
Standard Deviation
The standard deviation (SD) is the most common and stable measure of dispersion. It is calculated around the mean and represents the typical distance of data points from the mean. It is defined as the square root of the variance.
The Greek letter $\sigma$ (sigma) often denotes Standard Deviation for a population, while 's' or SD is used for a sample.
The formula for Standard Deviation for ungrouped data is:
$ s = \sqrt{\frac{\sum x^2}{N}} $
Where $x$ is the deviation of each score from the mean ($x = X - \bar{X}$) and $x^2$ is the squared deviation.
The term $\frac{\sum x^2}{N}$ before taking the square root is called the variance ($s^2$). Standard deviation is the square root of variance, and variance is the square of standard deviation.
Computing Standard Deviation for Ungrouped Data:
Example 2.8: Calculate the standard deviation for scores: 01, 03, 05, 07, 09.
Answer:
First, calculate the mean ($\bar{X}$).
$ \bar{X} = (1+3+5+7+9)/5 = 25/5 = 5 $
Calculate deviations from the mean (x) and squared deviations (x$^2$).
X (Score) | $x = X - \bar{X}$ (Deviation from Mean) | $x^2$ (Squared Deviation) |
---|---|---|
1 | $1 - 5 = -4$ | $(-4)^2 = 16$ |
3 | $3 - 5 = -2$ | $(-2)^2 = 4$ |
5 | $5 - 5 = 0$ | $(0)^2 = 0$ |
7 | $7 - 5 = 2$ | $(2)^2 = 4$ |
9 | $9 - 5 = 4$ | $(4)^2 = 16$ |
$\sum X = 25$ | $\sum x = 0$ (Check: sum of deviations is zero) | $\sum x^2 = 40$ |
N = 5.
$ s = \sqrt{\frac{\sum x^2}{N}} = \sqrt{\frac{40}{5}} = \sqrt{8} \approx 2.83 $
Computing Standard Deviation for Grouped Data:
For grouped data, a simplified calculation method similar to the indirect method for mean is often used.
$ s = i \times \sqrt{\frac{\sum fu^2}{N} - \left(\frac{\sum fu}{N}\right)^2} $
Where:
- s = Standard Deviation
- i = Class interval width
- f = Frequency of each class
- u = Simplified deviation of the midpoint of each class from the assumed mean (u = (midpoint - Assumed Mean) / i)
- $fu^2$ = product of frequency and squared simplified deviation
- N = Total frequency ($\sum f$)
- $\sum fu$ = Sum of (frequency * simplified deviation)
- $\sum fu^2$ = Sum of (frequency * squared simplified deviation)
Example: Calculate the standard deviation for the following distribution:
Groups | f |
---|---|
120-130 | 2 |
130-140 | 4 |
140-150 | 6 |
150-160 | 12 |
160-170 | 10 |
170-180 | 6 |
Answer:
Calculate midpoints, choose an assumed mean, calculate simplified deviations (u), fu, and fu$^2$. Assumed mean (A) = 155 (midpoint of 150-160), interval (i) = 10.
Group | f | Midpoint (x) | $u = (x - 155) / 10$ | fu | $u^2$ | $fu^2$ |
---|---|---|---|---|---|---|
120 - 130 | 2 | 125 | -3 | -6 | 9 | 18 |
130 - 140 | 4 | 135 | -2 | -8 | 4 | 16 |
140 - 150 | 6 | 145 | -1 | -6 | 1 | 6 |
150 - 160 | 12 | 155 | 0 | 0 | 0 | 0 |
160 - 170 | 10 | 165 | 1 | 10 | 1 | 10 |
170 - 180 | 6 | 175 | 2 | 12 | 4 | 24 |
Total | N = 40 | $\sum fu = 2$ | $\sum fu^2 = 74$ |
$ s = 10 \times \sqrt{\frac{74}{40} - \left(\frac{2}{40}\right)^2} = 10 \times \sqrt{1.85 - (0.05)^2} = 10 \times \sqrt{1.85 - 0.0025} = 10 \times \sqrt{1.8475} \approx 10 \times 1.359 = 13.59 $
Coefficient Of Variation (CV)
The Coefficient of Variation (CV) is a relative measure of dispersion. It is particularly useful for comparing the variability of datasets that are expressed in different units of measurement or have vastly different means. CV expresses the standard deviation as a percentage of the mean.
$ \text{CV} = \frac{\text{Standard Deviation}}{\text{Mean}} \times 100 $
$ \text{CV} = \frac{s}{\bar{X}} \times 100 $
Example: Calculate the CV for the dataset in Example 2.8 ($\bar{X} = 5$, $s \approx 2.83$).
Answer:
$ \text{CV} = \frac{2.83}{5} \times 100 = 0.566 \times 100 = 56.6\% $
A higher CV indicates greater relative variability or dispersion compared to the mean.
Measures Of Relationship
Measures of relationship explore the association or interdependence between two or more variables. When changes in one variable are associated with changes in another, we say they are related or correlated. Correlation is a measure of this relationship.
Correlation describes both the nature (direction) and strength (degree) of the relationship between variables.
Direction Of Correlation
The direction of correlation indicates whether variables change together in the same direction or opposite directions.
- Positive Correlation: Variables change in the same direction (as one increases, the other increases; as one decreases, the other decreases). Example: fertilizer consumption and crop yield (often positively correlated).
- Negative Correlation: Variables change in opposite directions (as one increases, the other decreases). Example: altitude and air pressure (negatively correlated).
- No Correlation (Zero Correlation): Changes in one variable do not correspond to any consistent change in the other variable.
A scatter plot visually shows the relationship between two variables. In a scatter plot, if points tend to rise from lower left to upper right, it indicates positive correlation. If points tend to fall from upper left to lower right, it indicates negative correlation. If points are scattered randomly with no clear pattern, it indicates no correlation.
Degree Of Correlation
The degree or strength of correlation measures how closely the two variables are related. It is expressed numerically, typically ranging from -1 to +1.
The correlation coefficient falls within the range of -1.00 to +1.00. It can never exceed 1 in either direction.
- Perfect Positive Correlation (+1.00): All data points fall exactly on a straight line that slopes upwards from left to right. There is a perfect, direct relationship.
- Perfect Negative Correlation (-1.00): All data points fall exactly on a straight line that slopes downwards from left to right. There is a perfect, inverse relationship.
- Zero Correlation (0.00): There is no linear relationship between the variables; data points are scattered randomly.
Correlations between 0 and $\pm 1$ indicate varying degrees of relationship:
- Weak Correlation: Data points are widely scattered around the trend line. The relationship is not strong. (e.g., correlation coefficient closer to 0, like $\pm 0.1$ to $\pm 0.3$).
- Moderate Correlation: Data points show some clustering around a trend line, but with noticeable scatter. (e.g., correlation coefficient around $\pm 0.4$ to $\pm 0.6$).
- Strong Correlation: Data points cluster closely around a trend line, indicating a strong relationship. (e.g., correlation coefficient around $\pm 0.7$ to $\pm 0.9$).
Spearman’s Rank Correlation
Spearman's Rank Correlation, denoted by $r_s$ or $\rho$ (rho), is a non-parametric method used to measure the degree of association between two variables based on their ranks rather than their raw values. It is particularly useful when data is ordinal or when the number of observations is small.
The formula for Spearman's Rank Correlation is:
$ r_s = 1 - \frac{6 \sum D^2}{N(N^2 - 1)} $
Where:
- $r_s$ = Spearman's Rank Correlation coefficient
- $\sum D^2$ = Sum of the squares of the differences between the ranks of corresponding pairs of X and Y variables
- N = Number of pairs of observations (number of items)
Steps for Calculation:
Example 2.9: Calculate Spearman’s Rank Correlation for the given scores in Economics (X) and Geography (Y).
Economics (X) | Geography (Y) |
---|---|
02 | 04 |
08 | 12 |
00 | 06 |
20 | 24 |
12 | 16 |
16 | 18 |
06 | 08 |
18 | 20 |
09 | 09 |
10 | 10 |
Answer:
Follow these steps to compute the rank correlation:
X (Score) | Y (Score) | XR (Rank of X) | YR (Rank of Y) | D (Difference in Ranks $|XR - YR|$) | D$^2$ |
---|---|---|---|---|---|
2 | 4 | 9 | 10 | $|9 - 10| = 1$ | 1 |
8 | 12 | 7 | 5 | $|7 - 5| = 2$ | 4 |
0 | 6 | 10 | 9 | $|10 - 9| = 1$ | 1 |
20 | 24 | 1 | 1 | $|1 - 1| = 0$ | 0 |
12 | 16 | 4 | 4 | $|4 - 4| = 0$ | 0 |
16 | 18 | 3 | 3 | $|3 - 3| = 0$ | 0 |
6 | 8 | 8 | 8 | $|8 - 8| = 0$ | 0 |
18 | 20 | 2 | 2 | $|2 - 2| = 0$ | 0 |
9 | 9 | 6 | 7 | $|6 - 7| = 1$ | 1 |
10 | 10 | 5 | 6 | $|5 - 6| = 1$ | 1 |
N = 10 | $\sum D^2 = 8$ |
Apply the formula:
$ r_s = 1 - \frac{6 \sum D^2}{N(N^2 - 1)} = 1 - \frac{6 \times 8}{10(10^2 - 1)} = 1 - \frac{48}{10(100 - 1)} = 1 - \frac{48}{10(99)} = 1 - \frac{48}{990} $
$ r_s = 1 - 0.04848... \approx 1 - 0.05 = 0.95 $
The rank correlation coefficient is approximately 0.95, indicating a very strong positive correlation between the scores in Economics and Geography for this group of students.
Rank correlation is a good alternative when the number of cases is small. For larger datasets, calculating ranks can become cumbersome, and other correlation methods might be more efficient.
Excercises
This section contains exercises covering the calculation and interpretation of measures of central tendency, dispersion, and correlation, allowing students to practice and apply the statistical techniques learned in the chapter.